When we want to explore the data, it is a good pratice to view top n contents of a file rather than loading the file entirely. Here, we start with exploring the data using helpful functions such as “head”, “tail”, “str”, and “dim”:
salary <- read.csv("./salary_data.csv", strip.white = T, na.strings = "?")
# View the first 6 rows of data
head(salary)
# View the last 6 rows of data
tail(salary)
# View a condensed summary of the data
str(salary) # Do we need this ? We are doing this again below
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education.num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ martital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ native.country : Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ salary.class : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# Check the class of salary
class(salary)
## [1] "data.frame"
# Check the dimensions of salary
dim(salary)
## [1] 32561 15
# View the column names of salary
colnames(salary)
## [1] "age" "workclass" "fnlwgt" "education"
## [5] "education.num" "martital.status" "occupation" "relationship"
## [9] "race" "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "salary.class"
It is recommended to always understand the structure of the data you are working with and displaying the summary of the data. To view the structure of the data we can use the following functions:
# Check the structure of salary
str(salary)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education.num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ martital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ native.country : Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ salary.class : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# View a summary of salary
summary(salary)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 State-gov : 1298 Mean : 189778
## 3rd Qu.:48.00 Self-emp-inc : 1116 3rd Qu.: 237051
## Max. :90.00 (Other) : 981 Max. :1484705
## NA's : 1836
## education education.num martital.status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty : 4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair : 4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial: 4066 Other-relative: 981 Black : 3124
## Adm-clerical : 3770 Own-child : 5068 Other : 271
## Sales : 3650 Unmarried : 3446 White :27816
## (Other) :10993 Wife : 1568
## NA's : 1843
## sex capital.gain capital.loss hours.per.week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## native.country salary.class
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
## NA's : 583
One of the most useful R packages in the majority of data science projects is “tidyverse”. This package enables us to use our data as an object, and explore/ process the dataset using various attributes. Basically, “tidyverse” helps us to treat our inputs (datasets, variables, vectors, …) as objects. This is quite important because R is a functional programming language rather than an object-oriented programming language (such as Python). Hence, “tidyverse” enables us to benefit from object-oriented programming style in R. This would make it very easy to explore, process, clean, and change the data in R. Please note that “tidyverse” is a collection of packages. When we load this package, we are actually loading several packages that are available within “tidyverse”. These packages within tidyverse() such as “dplyr” help us solve the most common data manipulation challenges very easily. The main attributes in “dplyr” include:
Below, we use these attributes to explore the data and perform common data wrangling tasks. But first, we use function glimpse() to get a glimse of the data. The output of this function is similar to str():
# Load tidyverse
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Check the structure of salary, the dplyr way
glimpse(salary)
## Observations: 32,561
## Variables: 15
## $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23,...
## $ workclass <fct> State-gov, Self-emp-not-inc, Private, Private, Priv...
## $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 16018...
## $ education <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors, Mas...
## $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12,...
## $ martital.status <fct> Never-married, Married-civ-spouse, Divorced, Marrie...
## $ occupation <fct> Adm-clerical, Exec-managerial, Handlers-cleaners, H...
## $ relationship <fct> Not-in-family, Husband, Not-in-family, Husband, Wif...
## $ race <fct> White, White, White, Black, Black, White, Black, Wh...
## $ sex <fct> Male, Male, Male, Male, Female, Female, Female, Mal...
## $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0,...
## $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30,...
## $ native.country <fct> United-States, United-States, United-States, United...
## $ salary.class <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, >5...
Before working on the examples, let’s just use “tidyverse” to remove all the observations with any missing values from our data:
# Removing entries with missing/ blank values from data:
salary <- salary %>%
drop_na()
Pipes take the output from one function and feed it to the first argument of the next function. You may have encountered the Unix pipe | before. The R language allows symbols wrapped in % to be defined as functions, the > helps imply a chain. The magrittr R package contains the pipe function %>%. In the exmple above salary is being piped into the function drop_na.
filter() allows us to select a subset of rows in a data frame. To use “dplyr” functions, we first mention the name of the data (a data frame or a tibble object), followed by %>%, and then followed by filter().
We add the conditions for filtering inside the function filter(). Here is an example to filter the rows based on education value “Masters”:
salary %>%
filter(education == "Masters")
If there are multiple conditions for filtering, we separate them using “,”. Below we demonstrate how we can combine filter for education and salary.class. The below code enables us to look at data with Masters degree earning more than 50K:
salary %>%
filter(education == "Masters", salary.class == ">50K")
In many cases, the dataset contains more columns than needed. Or in some occasions, we create new columns based on the original columns and won’t need the original ones anymore. select() allows us to quickly zoom in on a useful subset of the data:
salary %>%
select(age, workclass, education, hours.per.week, salary.class)
We can combine select() and filter() separated by %>%. In the following example, we only select columns age, workclass, education, hours.per.week, and salary.class. We also filter the rows to only include observations that have a value greater than 50 for column/ variable age:
salary %>%
select(age, workclass, education, hours.per.week, salary.class) %>%
filter(age > 50)
There are a number of helper functions we can use with select() , like starts_with(), ends_with(), matches() and contains(). These functions let us quickly match larger blocks of variables that meet some criterion. See ?select for more details. In the following example, we only select columns that start with “edu”:
salary %>%
select(starts_with("edu"))
Here, we only select columns that contain “class”:
salary %>%
select(contains("class"))
Please note that select() can be used to remove variables too. For instance, we can remove variables like this:
salary <- salary %>%
select(-c(variable1ToRemove, variable2ToRemove))
arrange() works similarly to filter() except that instead of filtering rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. In the following example, we arrange the data salary based on columns age and education.num:
salary %>%
arrange(age, education.num)
We can apply arrange in combination with other functions:
salary %>%
select(age, workclass, education.num, hours.per.week, salary.class) %>%
filter(age > 50) %>%
arrange(age, education.num)
We can use desc() to order a column in descending order:
salary %>%
select(age, workclass, education.num, hours.per.week, salary.class) %>%
filter(age > 50) %>%
arrange(desc(age), education.num)
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate(). In the following example, we create a new column which we call log_age which is the log-transformed of variable age:
salary %>%
mutate(log_age = log(age))
If you only want to keep the new variables, use transmute(). Below code shows how we keep log_age and log_edu instead of age abd education.num:
salary %>%
transmute(log_age = log(age), log_edu = log(education.num))
Notice that only the newly created columns are in the output.
We can use mutate_at to convert data from one type to another. For instance, imagine we want to convert some numeric and character variables to factors. Here is how to do it:
# Convert categorical variables to factors:
salary <- salary %>%
mutate_at(c("numericVariable1","numericVariable2","characterVariable1",
"characterVariable1"), factor)
Common type conversions that we can use include: - as.character - as.numeric - as.integer - as.factor - as.logical
summarise() collapses a data frame to a single row:
salary %>%
summarise(mean(age, na.rm = T))
Note: na.rm = TRUE automatically removes the missing values when performing the operation.
salary %>%
summarise(mean(age, na.rm = T), sd(age, na.rm = T), mean(education.num, na.rm = T),
sd(education.num, na.rm = T))
Later on, we learn how to use summarise() along with group_by().
In many real data science projects, we need to randomly sample the data (for instance, we usually train and test the models over different samples of the data). We can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.
salary %>%
sample_n(16)
salary %>%
sample_frac(.01)
It is very common to join two or more datasets to create a new dataset in a data science project. For instance, we could have two datasets each recorded in a different database. For building our model though, we may need to use the variables (columns) from both of those two datasets. To join datasets together, we need to find a unique identifier (key) in both datasets. For instance, student_id can be used to join a dataset about students’ academic performance to a dataset related to students’ financial aid. Once we have the key, we can use the following format to join the datasets:
new_data <- dataset1 %>%
inner_join(dataset2, by = key)
Now, let’s first create two datasets from our salary data and then join those two datasets together to recreate the original salary data. But first, we create a unique identifier which later we use as the key to rejoin the datasets:
salary <- salary %>%
mutate(id = row_number())
head(salary)
We then create the two datasets:
salary1 <- salary %>%
select(id,age,workclass,fnlwgt,education,education.num,martital.status,occupation)
head(salary1)
salary2 <- salary %>%
select(id,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,salary.class)
head(salary2)
Now we join them again to create the original data:
salary <- salary1 %>%
inner_join(salary2, by = "id")
head(salary)
Below, we can learn more about the different join types and also filtering joins as well as nesting joins(from https://dplyr.tidyverse.org/reference/join.html):
Currently dplyr supports four types of mutating joins, two types of filtering joins, and a nesting join.
Mutating joins combine variables from the two data.frames:
inner_join() return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join() return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join() return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join() return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
semi_join() return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
anti_join() return all rows from x where there are not matching values in y, keeping just columns from x.
The “dplyr” functions are useful on their own, but they become even more powerful when we apply them to groups of observations within a dataset. In “dplyr”, we do this with group_by() function. It breaks down a dataset into specified groups of rows. We then apply the other functions above on the resulting object they’ll be automatically applied by groups.
Grouping affects the functions as follows:
grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
grouped arrange() is the same as ungrouped; unless we set .by_group = TRUE, in which case it orders first by the grouping variables
mutate() and filter() are most useful in conjunction with window functions (like rank(), or min(x) == x).
sample_n() and sample_frac() sample the specified number/fraction of rows in each group.
summarise() computes the summary for each group.
In the following example, we split the dataset based on education into groups and then summarize each group by counting the number of observations (count = n()) and computing the average age (mean_age = mean(age, na.rm = TRUE)) in each group:
salary %>%
group_by(education) %>%
summarise(count = n(), mean_age = mean(age, na.rm = TRUE))
we can also group observations based on multiple columns:
salary %>%
group_by(education, sex) %>%
summarise(count = n(), mean_age = mean(age, na.rm = TRUE))
There are many other operations we can do using “dplyr”. Please refer to our references (https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html & https://dplyr.tidyverse.org) to learn more.
“tidyr” is another package within “tidyverse”. Here we use this package to further work with the data:
Here we really just use
gather combines multiple columns into two rows with key and valuespread moves key value columns to multiple columns with keys as column namesseparate splits a column by _ or whatever separator you choose to multiple columnsunite combines multiple columns to one column with _ as the separator#Load dairy data
dairy_clean <- read.csv("dairy_clean.csv", strip.white = T, na.strings = "")
head(dairy_clean)
# Apply gather() to dairy_clean and save the result as dairy_long
dairy_long <- gather(dairy_clean, YEAR, MILK_val, -FARM)
# View the first 20 rows of the result
head(dairy_long,20)
# Apply spread() to dairy_long
dairy_wide <- spread(dairy_long, YEAR, MILK_val)
# View the head of bmi_wide
head(dairy_wide)
# Apply unite() to dairy
dairy <- read.csv("dairy.csv", strip.white = T, na.strings = "")
dairy_unite <- unite(dairy, FARM_YEAR, FARM, YEAR, sep = "-")
# View the head of the result
head(dairy_unite)
# Apply separate() to dairy_unite
dairy_seperate <- separate(dairy_unite, col = FARM_YEAR, into = c("FARM", "YEAR"), sep = "-")
# Print the head of the result
head(dairy_seperate)
A complete cheat sheet of “tidyverse” is given below:
Tidyverse cheatsheet Part 1
Tidyverse cheatsheet Part 2
# Load lubridate package
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.6.2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
# Parse as date
dmy("17 Sep 2015")
## [1] "2015-09-17"
# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")
## [1] "2012-07-15 12:56:00 UTC"
str_trim, str_pad, str_detect, str_replacestringr replaces the disparate base r functions for working with stings
paste, paste0, gsub, grep, etc#Load stringr package
library(stringr)
# Trim all leading and trailing whitespace
str_trim(c(" Filip ", "Nick ", " Jonathan"))
## [1] "Filip" "Nick" "Jonathan"
# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width=9, side='left', pad="0")
## [1] "00023485W" "08823453Q" "00000994Z"
states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga",
"hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma",
"mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj", "nm", "ny",
"nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx",
"ut", "vt", "va", "wa", "wv", "wi", "wy")
# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)
states_upper
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
## [46] "VA" "WA" "WV" "WI" "WY"
# Make states_upper all lowercase again
states_lower <- tolower(states_upper)
states_lower
## [1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il" "in" "ia"
## [16] "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt" "ne" "nv" "nh" "nj"
## [31] "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn" "tx" "ut" "vt"
## [46] "va" "wa" "wv" "wi" "wy"
# Look at the head of salary
head(salary)
# Detect all native.country with United-States
str_detect(salary$native.country, "United-States")
# In the sex column, replace "Female" with "F"...
salary$sex <- str_replace(salary$sex, "Female", "F")
# ...And "Male" with "M"
salary$sex <- str_replace(salary$sex, "Male", "M")
# View the head of salary
head(salary)
salary2 <- read.csv("salary_subset.csv", strip.white = T, na.strings = "?")
# Call is.na() on the full salary2 to spot all NAs
is.na(salary2)
# Use the any() function to ask whether there are any NAs in the data
any(is.na(salary2))
## [1] TRUE
# View a summary() of the dataset
summary(salary2)
## age workclass fnlwgt education
## Min. :19.00 : 1 Min. : 28887 HS-grad :13
## 1st Qu.:30.00 Federal-gov : 3 1st Qu.:122147 Bachelors :11
## Median :38.50 Local-gov : 2 Median :192038 Some-college: 7
## Mean :38.42 Private :33 Mean :205162 11th : 4
## 3rd Qu.:49.00 Self-emp-not-inc: 6 3rd Qu.:269627 Masters : 4
## Max. :59.00 State-gov : 3 Max. :544091 9th : 3
## (Other) : 6
## education.num martital.status occupation
## Min. : 4.00 Divorced : 6 Exec-managerial : 9
## 1st Qu.: 9.00 Married-AF-spouse : 1 Prof-specialty : 7
## Median :10.00 Married-civ-spouse :26 Adm-clerical : 5
## Mean :10.33 Married-spouse-absent: 1 Machine-op-inspct: 5
## 3rd Qu.:13.00 Never-married :12 Sales : 4
## Max. :16.00 Separated : 2 (Other) :17
## NA's : 1
## relationship race sex capital.gain
## Husband :23 Amer-Indian-Eskimo: 1 Female:11 Min. : 0.0
## Not-in-family: 8 Asian-Pac-Islander: 3 Male :37 1st Qu.: 0.0
## Own-child : 7 Black :10 Median : 0.0
## Unmarried : 7 White :34 Mean : 446.6
## Wife : 3 3rd Qu.: 0.0
## Max. :14084.0
##
## capital.loss hours.per.week native.country salary.class
## Min. : 0.00 Min. :13.00 : 6 <=50K:36
## 1st Qu.: 0.00 1st Qu.:40.00 Cuba : 1 >50K :12
## Median : 0.00 Median :40.00 India : 1
## Mean : 71.88 Mean :41.23 Jamaica : 1
## 3rd Qu.: 0.00 3rd Qu.:44.25 Mexico : 1
## Max. :2042.00 Max. :80.00 Puerto-Rico : 1
## United-States:37
# Call table() on the native.country column
table(salary2$native.country)
##
## Cuba India Jamaica Mexico
## 6 1 1 1 1
## Puerto-Rico United-States
## 1 37
# Replace all empty strings in status with NA
salary2$native.country[salary2$native.country == ""] <- NA
# Print salary2 to the console
salary2
# Use complete.cases() to see which rows have no missing values
complete.cases(salary2)
## [1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## [13] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Use na.omit() to remove all rows with any missing values
na.omit(salary2)
summary and visualizations like hist or boxplot are good for finding these# Look at a summary() of salary
summary(salary)
## id age workclass fnlwgt
## Min. : 1 Min. :17.00 Private :22286 Min. : 13769
## 1st Qu.: 7541 1st Qu.:28.00 Self-emp-not-inc: 2499 1st Qu.: 117627
## Median :15082 Median :37.00 Local-gov : 2067 Median : 178425
## Mean :15082 Mean :38.44 State-gov : 1279 Mean : 189794
## 3rd Qu.:22622 3rd Qu.:47.00 Self-emp-inc : 1074 3rd Qu.: 237629
## Max. :30162 Max. :90.00 Federal-gov : 943 Max. :1484705
## (Other) : 14
## education education.num martital.status
## HS-grad :9840 Min. : 1.00 Divorced : 4214
## Some-college:6678 1st Qu.: 9.00 Married-AF-spouse : 21
## Bachelors :5044 Median :10.00 Married-civ-spouse :14065
## Masters :1627 Mean :10.12 Married-spouse-absent: 370
## Assoc-voc :1307 3rd Qu.:13.00 Never-married : 9726
## 11th :1048 Max. :16.00 Separated : 939
## (Other) :4618 Widowed : 827
## occupation relationship race
## Prof-specialty :4038 Husband :12463 Amer-Indian-Eskimo: 286
## Craft-repair :4030 Not-in-family : 7726 Asian-Pac-Islander: 895
## Exec-managerial:3992 Other-relative: 889 Black : 2817
## Adm-clerical :3721 Own-child : 4466 Other : 231
## Sales :3584 Unmarried : 3212 White :25933
## Other-service :3212 Wife : 1406
## (Other) :7585
## sex capital.gain capital.loss hours.per.week
## Length:30162 Min. : 0 Min. : 0.00 Min. : 1.00
## Class :character 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:40.00
## Mode :character Median : 0 Median : 0.00 Median :40.00
## Mean : 1092 Mean : 88.37 Mean :40.93
## 3rd Qu.: 0 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :99999 Max. :4356.00 Max. :99.00
##
## native.country salary.class
## United-States:27504 <=50K:22654
## Mexico : 610 >50K : 7508
## Philippines : 188
## Germany : 128
## Puerto-Rico : 109
## Canada : 107
## (Other) : 1516
# View a histogram of the age variable
hist(salary$age)
# View a histogram of the hours.per.week
hist(salary$hours.per.week)
# View a boxplot of age
boxplot(salary$age)
# View a boxplot of hours.per.week
boxplot(salary$hours.per.week)
We can also create a boxplot to compare the distribution of a numeric variable with respect to a categorical variables:
# View a boxplot of hours.per.week by sex
boxplot(salary$hours.per.week ~ salary$sex)
Below, we can find a collection of useful Functions that help us navigate the data in r:
head(data_csv, n=4) # Returns the first 4 rows of the data frame data_csv
View(data_csv) # Opens the data frame data_csv in R
colnames(data_csv) # Returns the column names in data_csv
nrow(data_csv) # Returns the number of rows in data_csv
ncol(data_csv) # Returns the number of columns in data_csv
dim(data_csv) # Returns the dimensions of data_csv
dim(data_csv)[2]
typeof(data_csv$age) # Returns the type of variable “age”
length(data_csv$age) # Returns the number of rows in vector “age”
mean(data_csv$age) # Returns the mean of variable “age”
sd(data_csv$age) # Returns the standard deviation of variable “age”
summary(data_csv$age) # Returns the summary of variable “age”
hist(data_csv$age) # Returns the histogram of variable “age”
unique(data_csv$age) # Returns all unique values of variable "age"
sum(is.na(data_csv$workclass)) # Returns the number of missing values in variable "workclass"